INTERSPEECH.2011 - Speech Synthesis | Cool Papers

#1 Decision tree-based clustering with outlier detection for HMM-based speech synthesis [PDF] [Copy] [Kimi¹]

Authors: Kyung Hwan Oh ; June Sig Sung ; Doo Hwa Hong ; Nam Soo Kim

In order to express natural prosodic variations in continuous speech, sophisticated speech units such as the context-dependent phone models are usually employed in HMM-based speech synthesis techniques. Since the training database cannot practically cover all possible context factors, decision tree-based HMM states clustering is commonly applied. One of the serious problems in a decision tree-based method is that the criterion used for node splitting and stopping is sensitive to irrelevant outlier data. In this paper, we propose a novel approach to removing outliers during the decision tree growing phase. Experimental results show that removing of outlying models improves the quality of the synthesized speech, especially for sentences which originally demonstrated poor quality.

#2 Prediction of voice aperiodicity based on spectral representations in HMM speech synthesis [PDF] [Copy] [Kimi¹]

Authors: Hanna Silén ; Elina Helander ; Moncef Gabbouj

In hidden Markov model-based speech synthesis, speech is typically parameterized using source-filter decomposition. A widely used analysis/synthesis framework, STRAIGHT, decomposes the speech waveform into a framewise spectral envelope and a mixed mode excitation signal. Inclusion of an aperiodicity measure in the model enables synthesis also for signals that are not purely voiced or unvoiced. In the traditional approach employing hidden Markov modeling and decision tree-based clustering, the connection between speech spectrum and aperiodicities is not taken into account. In this paper, we take advantage of this dependency and predict voice aperiodicities afterwards based on synthetic spectral representations. The evaluations carried out for English data confirm that the proposed approach is able to provide prediction accuracy that is comparable to the traditional approach.

#3 A perceptual expressivity modeling technique for speech synthesis based on multiple-regression HSMM [PDF] [Copy] [Kimi¹]

Authors: Takashi Nose ; Takao Kobayashi

This paper describes a technique for modeling and controlling emotional expressivity of speech in HMM-based speech synthesis. A problem of conventional emotional speech synthesis based on HMM is that the intensity of an emotional expression appearing in synthetic speech completely depends on the database used for model training. To take into account the emotional expressivity that listeners actually perceive, the perceptual expressivity scores are introduced into a style control technique based on multipleregression hidden semi-Markov model (MRHSMM). The objective and subjective evaluation results show that the proposed technique works well when there is a large bias of emotional expressivity in the training data.

#4 Multi-speaker modeling with shared prior distributions and model structures for Bayesian speech synthesis [PDF] [Copy] [Kimi¹]

Authors: Kei Hashimoto ; Yoshihiko Nankaku ; Keiichi Tokuda

This paper investigates a multi-speaker modeling technique with shared prior distributions and model structures for Bayesian speech synthesis. The quality of synthesized speech is improved by selecting appropriate model structures in HMM-based speech synthesis. Bayesian approach is known to work for such model selection. However, the result is strongly affected by prior distributions of model parameters. Therefore, determination of prior distributions and selection of model structures should be performed simultaneously. This paper investigates prior distributions and model structures in the situation where training data of multiple speakers are available. The prior distributions and model structures which represent acoustic features common to every speakers can be obtained by sharing them between multiple speaker-dependent models.

#5 Feature-space transform tying in unified acoustic-articulatory modelling for articulatory control of HMM-based speech synthesis [PDF] [Copy] [Kimi¹]

Authors: Zhen-Hua Ling ; Korin Richmond ; Junichi Yamagishi

In previous work, we have proposed a method to control the characteristics of synthetic speech flexibly by integrating articulatory features into hidden Markov model (HMM) based parametric speech synthesis. A unified acoustic-articulatory model was trained and a piecewise linear transform was adopted to describe the dependency between these two feature streams. The transform matrices were trained for each HMM state and were tied based on each state's context. In this paper, an improved acoustic-articulatory modelling method is proposed. A Gaussian mixture model (GMM) is introduced to model the articulatory space and the cross-stream transform matrices are trained for each Gaussian mixture instead of context-dependently. This means the dependency relationship can vary with the change of articulatory features flexibly. Our results show this method improves the effectiveness of control over vowel quality by modifying articulatory trajectories without degrading naturalness.

#6 The effect of using normalized models in statistical speech synthesis [PDF] [Copy] [Kimi¹]

Authors: Matt Shannon ; Heiga Zen ; William Byrne

The standard approach to HMM-based speech synthesis is inconsistent in the enforcement of the deterministic constraints between static and dynamic features. The trajectory HMM and autoregressive HMM have been proposed as normalized models which rectify this inconsistency. This paper investigates the practical effects of using these normalized models, and examines the strengths and weaknesses of the different models as probabilistic models of speech. The most striking difference observed is that the standard approach greatly underestimates predictive variance. We argue that the normalized models have better predictive distributions than the standard approach, but that all the models we consider are still far from satisfactory probabilistic models of speech. We also present evidence that better intra-frame correlation modelling goes some way towards improving existing normalized models.

#7 Continuous control of the degree of articulation in HMM-based speech synthesis [PDF] [Copy] [Kimi¹]

Authors: Benjamin Picart ; Thomas Drugman ; Thierry Dutoit

This paper focuses on the implementation of a continuous control of the degree of articulation (hypo/hyperarticulation) in the framework of HMM-based speech synthesis. The adaptation of a neutral speech synthesizer to generate hypo and hyperarticulated speech using a limited amount of speech data is first studied. This is done using inter-speaker voice adaptation techniques, applied here to intra-speaker voice adaptation. The implementation of a continuous control of the degree of articulation is then proposed in a second step. Finally, a subjective evaluation shows that good quality neutral/hypo/hyperarticulated speech, and also any intermediate, interpolated or extrapolated articulation degrees, can be obtained from an HMM-based speech synthesizer.

#8 Estimation of window coefficients for dynamic feature extraction for HMM-based speech synthesis [PDF] [Copy] [Kimi¹]

Authors: Ling-Hui Chen ; Yoshihiko Nankaku ; Heiga Zen ; Keiichi Tokuda ; Zhen-Hua Ling ; Li-Rong Dai

In standard approaches to hidden Markov model (HMM)-based speech synthesis, window coefficients for calculating dynamic features are pre-determined and fixed. This may not be optimal to capture various context-dependent dynamic characteristics in speech signals. This paper proposes a data-driven technique to estimate the window coefficients. They are optimized so as to maximize the likelihood of trajectory HMMs given data. Experimental results show that the proposed technique can achieve a comparable performance with the mean- and variance-updated trajectory HMMs in the naturalness of synthesized speech, while offering significantly lower computational cost.

#9 Inverse filtering based harmonic plus noise excitation model for HMM-based speech synthesis [PDF] [Copy] [Kimi¹]

Authors: Zhengqi Wen ; Jianhua Tao

In this paper, a new Voicing Cut-Off Frequency (VCO) estimation method based on inverse filtering is presented. The spectrum of residual signal got from inverse filtering is split into sub-bands which are clustered into two classes by using K-means algorithm. And then, the Viterbi algorithm is used to search a smoothed VCO contour. Based on this new VCO estimation method, an adaptation of Harmonic Noise Model is also proposed to reconstruct the residual signal with both harmonic and noise components. The proposed excitation model can reduce the buzziness of speech generated by normal vocoders using simple pulse train, and has been integrated into a HMM-based speech synthesis system (HTS). The listening test showed that the HTS with our new method gives better quality of synthesized speech than the traditional HTS which only uses simple pulse train excitation model.

#10 Improved HNM-based vocoder for statistical synthesizers [PDF] [Copy] [Kimi¹]

Authors: Daniel Erro ; Iñaki Sainz ; Eva Navas ; Inma Hernáez

Statistical parametric synthesizers have achieved very good performance scores during the last years. Nevertheless, as they require the use of vocoders to parameterize speech (during training) and to reconstruct waveforms (during synthesis), the speech generated from statistical models lacks some degree of naturalness. In previous works we explored the usefulness of the harmonics plus noise model in the design of a high-quality speech vocoder. Quite promising results were achieved when this vocoder was integrated into a synthesizer. In this paper, we describe some recent improvements related to the excitation parameters, particularly the so called maximum voiced frequency. Its estimation and explicit modelling leads to an even better synthesis performance as confirmed by subjective comparisons with other well-known methods.

#11 A statistical phrase/accent model for intonation modeling [PDF] [Copy] [Kimi¹]

Authors: Gopala Krishna Anumanchipalli ; Luís C. Oliveira ; Alan W. Black

This paper proposes a statistical phrase/accent model of voice fundamental frequency (F0) for speech synthesis. It presents an approach for automatic extraction and modeling of phrase and accent phenomena from F0 contours by taking into account their overall trends in the training data. An iterative optimization algorithm is described to extract these components, minimizing the reconstruction error of the F0 contour. This method of modeling local and global components of F0 separately is shown to be better than conventional F0 models used in Statistical Parametric Speech Synthesis (SPSS). Perceptual evaluations confirm that the proposed model is significantly better than baseline SPSS F0 models in 3 prosodically diverse tasks . read speech, radio broadcast speech and audio book speech.

#12 Intermediate-state HMMs to capture continuously-changing signal features [PDF] [Copy] [Kimi¹]

Authors: Gustav Eje Henter ; W. Bastiaan Kleijn

Traditional discrete-state HMMs are not well suited for describing steadily evolving, path-following natural processes like motion capture data or speech. HMMs cannot represent incremental progress between behaviors, and sequences sampled from the models have unnatural segment durations, unsmooth transitions, and excessive rapid variation. We propose to address these problems by permitting the state variable to occupy positions between the discrete states, and present a concrete left-right model incorporating this idea. We call this intermediate-state HMMs. The state evolution remains Markovian. We describe training using the generalized EM-algorithm and present associated update formulas. An experiment shows that the intermediate-state model is capable of gradual transitions, with more natural durations and less noise in sampled sequences compared to a conventional HMM.

#13 Automatic sentence selection from speech corpora including diverse speech for improved HMM-TTS synthesis quality [PDF] [Copy] [Kimi¹]

Authors: Norbert Braunschweiler ; Sabine Buchholz

Using publicly available audiobooks for HMM-TTS poses new challenges. This paper addresses the issue of diverse speech in audiobooks. The aim is to identify diverse speech likely to have a negative effect on HMM-TTS quality. Manual removal of diverse speech was found to yield better synthesis quality despite halving the training corpus. To handle large amounts of data an automatic approach is proposed. The approach uses a small set of acoustic and text based features. A series of listening tests showed that the manual selection is most preferred, while the automatic selection showed significant preference over the full training set.

#14 Phonological knowledge guided HMM state mapping for cross-lingual speaker adaptation [PDF] [Copy] [Kimi¹]

Authors: Hui Liang ; John Dines

Within the HMM state mapping-based cross-lingual speaker adaptation framework, the minimum Kullback-Leibler divergence criterion has been typically employed to measure the similarity of two average voice state distributions from two respective languages for state mapping construction. Considering that this simple criterion doesn't take any language-specific information into account, we propose a data-driven, phonological knowledge guided approach to strengthen the mapping construction . state distributions from the two languages are clustered according to broad phonetic categories using decision trees and mapping rules are constructed only within each of the clusters. Objective evaluation of our proposed approach demonstrates reduction of mel-cepstral distortion and that mapping rules derived from a single training speaker generalize to other speakers, with subtle improvement being detected during subjective listening tests.

#15 Reformulating prosodic break model into segmental HMMs and information fusion [PDF] [Copy] [Kimi¹]

Authors: Nicolas Obin ; Pierre Lanchantin ; Anne Lacheret ; Xavier Rodet

In this paper, a method for prosodic break modelling based on segmental-HMMs and Dempster-Shafer fusion for speech synthesis is presented, and the relative importance of linguistic and metric constraints in prosodic break modelling is assessed1. A context-dependent segmental-HMM is used to explicitly model the linguistic and the metric constraints. Dempster-Shafer fusion is used to balance the relative importance of the linguistic and the metric constraints into the segmental-HMM. A linguistic processing chain based on surface and deep syntactic parsing is additionally used to extract linguistic informations of different nature. An objective evaluation proved evidence that the optimal combination of the linguistic and the metric constraints significantly outperforms both the conventional HMM (linguistic information only) and segmental-HMM (equal balance of linguistic and metric constraints), and confirmed that the linguistic constraint is prior to the metric.

#16 Multipulse sequences for residual signal modeling [PDF] [Copy] [Kimi¹]

Authors: Ranniery Maia ; Heiga Zen ; Kate Knill ; M. J. F. Gales ; Sabine Buchholz

In source-filter models of speech production, the residual signal . what remains after passing the speech signal through the inverse filter . contains important information for the generation of naturally sounding re-synthesized speech. Typically, the voiced regions of residual signals are regarded as a mixture of glottal pulse and noise. This paper introduces a novel approach to represent the noise component of voiced regions of residual signals through autoregressive filtering of multipulse sequences. The positions and amplitudes of the non-zero samples of these multipulse signals are optimized through a closed-loop procedure. The method in question is applied to excitation modeling in statistical parametric synthesis. Experimental results indicate that the use of multipulse-based noise component construction eliminates the necessity of run-time ad hoc procedures such as high-pass filtering and time modulation, common on excitation models for statistical parametric synthesizers, with no loss of synthesized speech quality.

#17 Can objective measures predict the intelligibility of modified HMM-based synthetic speech in noise? [PDF] [Copy] [Kimi¹]

Authors: Cassia Valentini-Botinhao ; Junichi Yamagishi ; Simon King

Synthetic speech can be modified to improve intelligibility in noise. In order to perform modifications automatically, it would be useful to have an objective measure that could predict the intelligibility of modified synthetic speech for human listeners. We analysed the impact on intelligibility . and on how well objective measures predict it.when we separately modify speaking rate, fundamental frequency, line spectral pairs and spectral peaks. Shifting LSPs can increase intelligibility for human listeners; other modifications had weaker effects. Among the objective measures we evaluated, the Dau model and the Glimpse proportion were the best predictors of human performance.

#18 Speech synthesis based on articulatory-movement HMMs with voice-source codebooks [PDF] [Copy] [Kimi¹]

Authors: Tsuneo Nitta ; Takayuki Onoda ; Masashi Kimura ; Yurie Iribe ; Kouichi Katsurada

Speech synthesis based on one-model of articulatory movement HMMs, that are commonly applied to both speech recognition (SR) and speech synthesis (SS), is described. In an SS module, speaker-invariant HMMs are applied to generate an articulatory feature (AF) sequence, and then, after converting AFs into vocal tract parameters by using a multilayer neural network (MLN), a speech signal is synthesized through an LSP digital filter. The CELP coding technique is applied to improve voice-sources when generating these sources from embedded codes in the corresponding state of HMMs. The proposed SS module separates phonetic information and the individuality of a speaker. Therefore, the targeted speaker's voice can be synthesized with a small amount of speech data. In the experiments, we carried out listening tests for ten subjects and evaluated both of sound quality and individuality of synthesized speech. As a result, we confirmed that the proposed SS module could produce good quality speech of the targeted speaker even when the training was done with the data set of two-sentences.

#19 Large-scale subjective evaluations of speech rate control methods for HMM-based speech synthesizers [PDF] [Copy] [Kimi¹]

Authors: Tsuneo Kato ; Makoto Yamada ; Nobuyuki Nishizawa ; Keiichiro Oura ; Keiichi Tokuda

Three speech rate control methods for HMM-based speech synthesis were compared by large-scale subjective evaluations. The methods are 1) synthesizing speech sounds based on HMMs trained from corpora at a target speech rate, 2) stretching or shrinking utterance durations proportionally in waveform generation, and 3) determining state durations based on ML criterion under a restriction of utterance duration. The results indicated that the proportional shrinking had significant advantages for fast rate, whereas HMMs trained from slow speech sounds had a slight advantage for slow rate. We also found an advantage of proportionally shrunk speech from a synthesizer trained from slow speech corpora.

#20 HMM-based emphatic speech synthesis using unsupervised context labeling [PDF] [Copy] [Kimi¹]

Authors: Yu Maeno ; Takashi Nose ; Takao Kobayashi ; Yusuke Ijima ; Hideharu Nakajima ; Hideyuki Mizuno ; Osamu Yoshioka

This paper describes an approach to HMM-based expressive speech synthesis which does not require any supervised labeling process for emphasis context. We use appealing-style speech whose sentences were taken from real domains. To reduce the cost for labeling speech data with an emphasis context for the model training, we propose an unsupervised labeling technique of the emphasis context based on the difference between original and generated F0 patterns of training sentences. Although the criterion for the emphasis labeling is quite simple, subjective evaluation results reveal that the unsupervised labeling is comparable to the labeling conducted carefully by a human in terms of speech naturalness and emphasis reproducibility.

#21 Enriching text-to-speech synthesis using automatic dialog act tags [PDF] [Copy] [Kimi¹]

Authors: Vivek Kumar Rangarajan Sridhar ; Ann Syrdal ; Alistair D. Conkie ; Srinivas Bangalore

We present an approach for enriching dialog based text-to-speech (TTS) synthesis systems by explicitly controlling the expressiveness through the use of dialog act tags. The dialog act tags in our framework are automatically obtained by training a maximum entropy classifier on the Switchboard-DAMSL data set, unrelated to the TTS database. We compare the voice quality produced by exploiting automatic dialog act tags with that using human annotations of dialog acts, and with two forms of reference databases. Even though the inventory of tags is different for the automatic tagger and human annotation, exploiting either form of dialog markup generates better voice quality in comparison with the reference voices in subjective evaluation.

#22 Joint target and join cost weight training for unit selection synthesis [PDF] [Copy] [Kimi¹]

Authors: Lukas Latacz ; Wesley Mattheyses ; Werner Verhelst

One of the key challenges of optimizing a unit selection voice is obtaining suitable target and join cost weights. In this paper we investigate several strategies to train these weights automatically. Two training algorithms are tested, which are based on an acoustic distance that approximates human perception: a modified version of the well-known linear regression training and an iterative algorithm that tries to minimize a selection error. Since a single, global set of weights might not result in selecting all the time the best sequence of units, we investigate whether using multiple weight sets could improve the synthesis quality.

#23 Prominence-based prosody prediction for unit selection speech synthesis [PDF] [Copy] [Kimi¹]

Authors: Andreas Windmann ; Igor Jauk ; Fabio Tamburini ; Petra Wagner

This paper describes the development and evaluation of a prosody prediction module for unit selection speech synthesis that is based on the notion of perceptual prominence. We outline the design principles of the module and describe its implementation in the Bonn Open Synthesis System (BOSS). Moreover, we report results of perception experiments that have been conducted in order to evaluate prominence prediction. The paper is concluded by a general discussion of the approach and a sketch of perspectives for further work.

#24 Evaluating the meaning of synthesized listener vocalizations [PDF] [Copy] [Kimi¹]

Authors: Sathish Pammi ; Marc Schröder

Spoken and multimodal dialogue systems start to use listener vocalizations for more natural interaction. In a unit selection framework, using a finite set of recorded listener vocalizations, synthesis quality is high but the acoustic variability is limited. As a result, many combinations of segmental form and intended meaning cannot be synthesized.

#25 A hybrid TTS approach for prosody and acoustic modules [PDF] [Copy] [Kimi¹]

Authors: Iñaki Sainz ; Daniel Erro ; Eva Navas ; Inma Hernáez

Unit selection (US) TTSs generate quite natural speech but highly variable in quality. Statistical parametric (SP) systems offer far more consistent quality but reduced naturalness due to its vocoding nature. We present a hybrid approach (HA) that tries to improve the overall naturalness combining both synthesis methods. Contrary to other works, the fusion of methods is performed both in prosody and acoustic modules yielding a more robust prosody prediction and achieving greater naturalness. Objective and subjective experiments show the validity of our procedure.